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Abstract 

Recent work has shown how information theory extends conventional full-rationality game 
theory to allow bounded rational agents. The associated mathematical framework can be used to 
solve constrained optimization problems. This is done by translating the problem into an iterated 
game, where each agent controls a different variable of the problem, so that the joint probability 
distribution across the agents’ moves gives an expected value of the objective function. The 
dynamics of the agents is designed to minimize a Lagrangian function of that joint distribution. 
Here we illustrate how the updating of the Lagrange parameters in the Lagrangian is a form of 
automated annealing, which focuses the joint distribution more and more tightly about the joint 
moves that optimize the objective function. We then investigate the use of “semicoordinate” 
variable transformations. These separate the joint state of the agents from the variables of the 
optimization problem, with the two connected by an onto mapping. We present experiments 
illustrating the ability of such transformations to facilitate optimization. We focus on the special 
kind of transformation in which the statistically independent states of the agents induces a 
mixture distribution over the optimization variables. Computer experiment illustrate this for 
fc-sat constraint satisfaction problems and for unconstrained minimization of NK functions. 

Subject Classification: programming: nonlinear, algorithms, theory; probability: applications 
Area of Review: optimization 
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1 Introduction 


1.1 Distributed optimization and control with Probability Collectives 

As first described in (Wolpert 2003a, Wolpert 2004a), it turns out that one can translate many of 
the concepts from statistical physics, game theory, distributed optimization and distributed control 
into one another. This translation is based on the fact that those concepts all involve distributed 
systems in which the random variables are, at any single instant, statistically independent. (What 
is coupled is instead the distributions of those variables.) Using this translation, one can transfer 
theory and techniques between those fields, creating a large common mathematics that connects 
them. This common mathematics is known as Probability Collectives (PC). Its unifying concern 
is the set of probability distributions that govern any particular distributed system, and how to 
manipulate those distributions to optimize one or more objective functions. See (Wolpert, Turner 
& Bandari 2003, Wolpert & Turner 2001) for earlier, less formal work on this topic. 

In this paper we consider the use of PC to solve constrained optimization and/or control prob- 
lems. Reflecting the focus of PC on distributed systems, its use for such problems is particularly 
appropriate when the variables in the collective are spread across many physically separated agents 
with limited inter-agent communication (e.g., in a distributed design or supply chain application, 
or distributed control). A general advantage of PC for such problems is that since they work with 
probabilities rather than the underlying variables, they can be implemented for arbitrary types of 
the underlying variables. This same characteristic also means they provides multiple solutions, each 
of which is robust, along with sensitivity information concerning those solutions. An advantage 
particulary relevant to optimization is that the distributed PC algorithm can often be implemented 
on a parallel computer. An advantage particularly relevant to control problems is that PC algo- 
rithms can, if desired, be used without any modelling assumptions about the (stochastic) system 
being controlled. These advantages are discussed in more detail below. 

1.2 The Probability Collectives Approach 

Broadly speaking, the PC approach to optimization/control is as follows. First one maps the 
provided problem into a multi-agent collective. In the simplest version of this process one assigns 
a separate agent of the collective to determine the value of each of the variables Xi € A) in the 
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problem that we control. So for example if the i’th variable can only take on a finite number 
of values, those \X ^ | possible values constitute the possible moves of the i’th agent . 1 The value 
of the joint set of n variables (agents) describing the system is then x = [aq, ■ *-• ,x n ] € X with 
X = X x x ■ • • x X n ? 

Unlike many optimization methods, in PC the variables are not manipulated directly. Rather a 
probability distribution is what is manipulated. To avoid combinatorial explosions as the number 
of dimensions of X grows, we must restrict attention to a low-dimensional subset of the space of 
all probability distributions. We indicate this by writing our distributions as q G Q over X. The 
manipulation of that q proceeds through an iterative process. The ultimate goal of this process is 
to induce a distribution that is highly peaked about the x optimizing the objective function G(x), 
sometimes called the world cost or world utility function. (In this paper we only consider problems 
with a single overall objective function, and we arbitrarily choose lower values to be better, even 
when using the term “utility”.) 

In the precise algorithms investigated here, at the start of any iteration a single Lagrangian 
function of q, C : Q — > M, is specified, based on G(x) and the associated constraints of the 
optimization problem. Rather than minimize the objective function over the space X, the algorithm 
minimizes that Lagrangian over q E Q. This is done by direct manipulation of the components of 
q by the agents. 

After such a minimization of a Lagrangian, one modifies the Lagrangian slightly. This is done 
so that the q optimizing the new Lagrangian is more tightly concentrated about x that solve our 
optimization problem than is the current q. One then uses the current q as the starting point for 
another process of having the agents minimize a Lagrangian, this time having them work on that 
new Lagrangian. 

At the end of a sequence of such iterations one ends up with a final q. That q is then used to 
determine a final answer in X, e.g., by sampling q, evaluating its mode, evaluating its mean (if that 
is defined), etc. For a properly chosen sequence of Lagrangians and algorithm for minimizing the 
Lagrangians, this last step should, with high probability, provide the desired optimal point in X. 

For the class of Lagrangians used in this paper, the sequence of minimizations of Lagrangians is 
closely related to simulated annealing. The difference is that in simulated annealing an inefficient 
Metropolis sampling process is used to implicitly descend each iteration’s Lagrangian. By explicitly 


3 



manipulating q, PC allows for more efficient descent. 

In this paper we shall consider the case where Q is a product space, q(x) = fl,; Qi( x i)- The 
associated formulation of PC is sometimes called “Product Distribution” theory. It corresponds 
to noncooperative game theory, with each q . j being agent Ps “mixed strategy” (Wolpert 2004a, 
Fudenberg Sz Tirole 1991). Our particular focus is the use of such product distributions when A 
is not the same as the ultimate space of the optimization variables, Z. In this formulation — a 
modification of what was presented above — there is an intermediate mapping from X — > Z, and 
the provided G is actually a function over Z , not (directly) over X . Such intermediate mappings are 
called semicoordinate systems, and going from one to another is a semicoordinate transformation. 
As elaborated below, such transformations allow arbitrary coupling among the variables in Z while 
preserving many of the computational advantages of using product distributions over X . 

1.3 Advantages of Probability Collectives 

There are many advantages to working with distribution in Q rather than points in X . Usually 
the support of q is all of X, i.e., the q minimizing the Lagrangian lies in the interior of the unit 
simplices giving Q. Conversely, any element of X can be viewed as a probability distribution on the 
edge (a vertex) of those simplices. So working with A is a special case of working with Q, where 
one sticks to the vertices of Q. In this, optimizing over Q rather than X is analogous to interior 
point methods. Due to the breadth of the support of q, minimizing over it can also be viewed as 
a way to allow information from the value of the objective function at all x £ X to be exploited 
simultaneously. 

Another advantage, alluded to above, is that by working with distributions Q rather than 
the space X, the same general PC approach can be used for essentially any X, be it continuous, 
discrete, time-extended, mixtures of these, etc. (Formally, those different spaces just correspond 
to different probability measures, as far as PC is concerned.) For expository simplicity though, 
here we will work with finite X, and therefore have probability distributions rather than density 
functions, sums rather than integrals, etc. See in particular (Bieniawski &; Wolpert 2004a, Wolpert 
& Bieniawski 2004a, Wolpert 20046, Wolpert 2004c) for analysis explicitly for the case of infinite 
A. 

Yet another advantage arises from the fact that even when A is finite, g £ Q is a vector in 
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a Euclidean space. Accordingly the Lagrangian we are minimizing is a real-valued function of a 
Euclidean vector. This means PC allows us to leverage the power of descent schemes for continuous 
spaces like gradient descent or Newton’s method — even if A is a categorical, finite space. So with 
PC, schemes like “gradient descent for categorical variables” are perfectly well-defined. 

While the Lagrangians can be based on prior knowledge or modelling assumptions concerning 
the problem, they need not be. Nor does optimization of a Lagrangian require control of all variables 
X (i.e. , some of the variables can be noisy). This allows PC to be very broadly applicable. 

1.4 Connection with other sciences 

A more general advantage of PC is how it relates seemingly disparate disciplines to one another. In 
particular, it can be motivated by using information theory to relate bounded rational game theory 
to statistical physics (Wolpert 2003a, Wolpert 2004a). This allows techniques from one field to 
be imported into the other field. For example, as illustrated below, the grand canonical ensemble 
of physics can be imported into noncooperative game theory to analyze games having stochastic 
numbers of the players of various types. 

To review, a noncooperative game consists of a sequence of stages. At the beginning of each 
stage every agent (aka “player”) sets a probability distribution (its “mixed strategy”) over its 
moves (Fudenberg & Tirole 1991, Aumann & Hart 1992, Basar &: Olsder 1999, Fudenberg &; Levine 
1998). The joint move at the stage is then formed by agents simultaneously sampling their mixed 
strategies at that stage. So the moves those agents make at any particular stage of the game 
are statistically independent and the distribution of the joint-moves at any stage is a product 
distribution — just like in PD theory. 

This does not mean that the moves of the agents across all time are statistically independent 
however. At each stage of the game each agent will set its mixed strategy based on information 
gleaned from preceding stages, information that in general will reflect the earlier moves of the other 
agents. So the agents are coupled indirectly, across time, via the updating of the at the end 

of each stage. 

Analogously, consider again the iterative PD algorithm outlined above, and in particular the 
process of optimizing the Lagrangian within some particular single iteration. Typically that process 
proceeds by successively modifying q across a sequence of timesteps. In each of those timesteps 
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q(x) = n iQi( x i) is first sampled, and then it is updated based on all previous samples. So just 
like in a noncooperative game there is no direct coupling of the values of the underlying variables 
{xi } at any particular timestep (q is a product distribution). Rather just like in a noncooperative 
game, the variables are indirectly coupled, across time (i.e. , across timesteps of the optimization), 
via coupling of the distributions qi(xi) at different timesteps. 

In addition, information theory can be used to show that the bounded rational equilibrium of any 
noncooperative game is the q optimizing an associated “maxent Lagrangian” C(q) (Wolpert 2004a). 
(That Lagrangian is minimized by the distribution that has maximal entropy while being consistent 
with specified values of the average payoffs of the agents.) This Lagrangian turns out to be exactly 
the one that arises in the version of PC considered in this paper. So bounded rational game theory 
is an instance of PC. 

Now in statistical physics often one wishes to find the distribution out of an allowed set of 
distributions (e.g., Q ) with minimal distance to a fixed target distribution p € V, the space of all 
possible distributions over X. Perhaps the most popular choice for a distance measure between 
distributions is the Kullback-Leibler (KL) distance 3 : D(q\\p) = ]C X q(x) ln(c/(x)/p(x)) (Cover & 
Thomas 1991). As the KL distance is not symmetric in its arguments p and q we shall refer to 
D(q\\p) as the qp KL distance (this is also sometimes called the exclusive KL distance), and D(p\\q) 
as the pq distance (also sometimes called the inclusive KL distance). 

Typically in physics p is given by one of the statistical “ensembles”. An important exam- 
ple of such KL minimization arises with the Boltzmann distribution of the canonical ensemble: 
p(x) oc exp[— Lf(x)/T], where H is the “Hamiltonian” of the system. The KL distance D(q\\p) 
to the Boltzmann distribution is proportional to the Gibbs free energy of statistical physics. This 
free energy turns out to be identical to the maxent Lagrangian considered in this paper. Stated 
differently, if one solves for the distribution q from one’s set that minimizes qp KL distance to the 
Boltzmann distribution, one gets the distribution from one’s set having maximal entropy, subject 
to the constraint of having a specified expected value of H. When the set of distributions one’s 
considering is Q, the set of product distributions, this q minimizing qp KL distance to p is called 
a “mean- field approximation” to p. So mean- field theory is an instance of PC. 

This illustrates that bounded rational games and the mean-field approximation to Boltzmann 
distributions are essentially identical. To relate them one equates H with a common payoff function 
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G. The equivalence is completed by then identifying each (independent) agent with a different one 
of the (independent) physical variables in the argument of the Hamiltonian. 

This connection between these fields allows us to exploit techniques from statistical physics 
in bounded rational game theory. For example, as mentioned above, rather than the canonical 
ensemble, we can apply the grand canonical ensemble to bounded rational games. This allows us 
to consider games in which the number of players of each type is stochastic (Wolpert 2004a). 

1.5 The contribution of this paper 

Use of a product distribution space Q for optimization/control is consistent with game theory (and 
more generally multi-agent systems). This choice also results in a highly parallel algorithm, and is 
well-suited to control problems that are inherently distributed. Nonetheless, other concerns may 
dictate different Q. In particular, in many optimization tasks we seek multiple solutions far apart 
from one another. For example, in Constraint Satisfaction Problems (CSPs) (Dechter 2003), the 
goal is to identify all feasible solutions which satisfy a set of constraints, or to show that none exist. 
Typically when there are multiple feasible solutions they are vary far from one another. For small 
problem instances exhaustive enumeration techniques like branch-and-bound are typically used to 
identify all such feasible solutions; we are interested in larger problems. 

In cases like these, where we desire multiple far-apart solutions, use of PC with a product 
distribution may be a poor choice. The problem is that if each distribution qi is peaked about 
every value of Xi which occurs in at least one of the multiple solutions, then in general there will 
be spurious peaks in the product g(x) = be-, ^( x ) ma Y be peaked about some x that are 

not solutions. On the other hand the alternative scenario, where each q t is only peaked about a 
few of the solutions, does not provide us the desired many solutions. To address this we might 
descend the Lagrangian many times, beginning from different starting points (i.e. , different initial 
q). However there is no guarantee that multiple runs will each generate different solutions. 

PC offers a simple solution to this problem that allows one to still use product distributions: 
extend the event space underlying our product distributions so that a single game provides mul- 
tiple distinct solutions to our optimization problem at once. Formally, this is a semicoordinate 
transformation. Intuitively speaking, the transformation considered here recasts the problem in 
terms of a “meta-game” by cloning the original game into several simultaneous games, with an 
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independent set of agents for each game. We also have a supervisor agent who chooses what game 
is to be played. We then form a Lagrangian for the meta-game that is biased towards having any 
agents that control the same variable in different games have different mixed strategies from one 
another. The joint strategies for each of the separate games in the meta-game then give us our set 
of multiple solutions to the original game. The supervisor agent sets the probability distribution of 
which such solution is used. Since in general the resultant distribution across the variables being 
optimized (i.e., across Z) cannot be written as a single product distribution, it provides coupling 
among those variables. 

More precisely, recall that the space of arguments to our objective function is Z , and our 
product distributions are instead over X , with the “semicoordinate system” being the map from 
this space to Z (Wolpert &; Bieniawski 2004a, Wolpert 2004d). Before transformation of the 
semicoordinate system, X = Z, and product distributions over X cannot give coupled distributions 
over Z. However we will change X from Z and change the semicoordinate system in an associated 
way. We then consider product distributions over the new X (i.e., the noncooperative game is 
played in X, not Z). By appropriate choice of the semicoordinate transformation, such distributions 
corresponds to coupled distributions across Z. In general any Bayes net topology can be achieved 
with an appropriate semicoordinate transformation (Wolpert 2004d, Wolpert & Bieniawski 2004a). 
Different product distributions over Z correspond to different Bayes nets having that same topology. 

Here we consider a X that results in a mixture of M product distributions Z , (Macready & 
Wolpert 20046) 

M 

<?( z ) = q°{m)q m ( z). 

m= 1 

Intuitively, q° is the distribution over the moves of the supervisor agent, with m being the game 
that agent chooses. This allows for the determination of M solutions at once. At the same time, 
due to the entropy term in the Lagrangian, it “pushes” the separate products q m ( z) in the mixture 
apart. This biases the algorithm to trying to find separated solutions, as desired. 

In Sec. 2 we review how one arrives at the Lagrangian considered in this paper, the maxent 
Lagrangian. Then in Sec. 3 we review two elementary techniques introduced in (Wolpert & 
Bieniawski 20046, Wolpert 2003a, Wolpert 20046) for updating a product distribution q to minimize 
the associated Lagrangian. In the experiments reported below, all terms in those update rules can 



be calculated in closed form. This is not true in general though. In Sec. 4 we review a set of 
Monte-Carlo techniques for addressing such general scenarios. 

We review semicoordinate transformations in Sec. 5, with particular attention for how mixture 
models may be seen as a product distributions over a different space. In Sec. 6 we analyze the 
minimization of the maxent Lagrangian associated with mixture-inducing semicoordinate transfor- 
mations. In that section we also relate our maxent Lagrangian for mixture distributions to the 
Jensen-Shannon distance over X. Experimental validation of these techniques is then presented 
for the ^-satisfiability CSP problem (section 7.1) and the NK (section 7.2) optimization problems. 
These sections consider both situations where the semicoordinate transformation is fixed upfront 
and those where it is found dynamically. 

We end with a synopsis of some other techniques for updating a product distribution q to 
minimize the associated Lagrangian. This synopsis serves as the basis for a discussion of the 
relationship between PC and other techniques. The distinguishing feature of PC is that it does not 
treat the variable x as the fundamental object to be optimized, but rather the distribution across 
it, q. Furthermore, samples of that distribution are only used if necessary to estimate quantities 
that cannot be evaluated other ways. The fundamental objective function is stated in terms of q. 

It should be emphasized that like all of PC, the techniques presented in this paper can readily be 
applied to many problems other than constrained optimization. For example, PC provides a natural 
improvement to the Metropolis sampling algorithm (Wolpert & Lee 2004), which the techniques 
of this paper should be able to improve further. See (Antoine, Bieniawski, Kroo & Wolpert 2004, 
Wolpert & Bieniawski 2004a, Bieniawski, Wolpert & Kroo 2004, Bieniawski & Wolpert 20046) for 
other examples and experiments. 

2 The Lagrangian for Product Distributions 

We begin by considering the case of the identity semicoordinate system, X = Z. As discussed above, 
we consider qp distance to the T-parameterized Boltzmann distribution p(x) = exp[— G(x) / T]/Z(T ) 
where Z(T ) is a normalization constant. At low T the Boltzmann distribution is concentrated on 
x having low G values, so that the product distribution with minimal qp distance to it would be 
expected to have the same behavior. Accordingly, one would expect that by taking qp KL distance 
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to this distribution as one’s Lagrangian, and modifying the Lagrangian from one iteration to the 
next by lowering T, one should end up at a q concentrated on x having low G values. (See (Wolpert 
& Bieniawski 20046, Wolpert 20046, Wolpert 2004c) for a more detailed formal justification of using 
this Lagrangian based on solving constrained optimization problems with Lagrange parameters.) 

More precisely, the qp KL distance to the Boltzmann distribution is the maxent Lagrangian, 

jC(q)=E g (G)-TS(q) (1) 

up to irrelevant additive and multiplicative constants. Equivalently, we can write it as 

C(q) = tm q (G) - S{q) (2) 

where (5 = 1/T, up to an irrelevant overall constant. In these equations the inner product E 9 (G) = 
^ x g(x)G(x) is the expected value of G under q, and S(q) = — q(x) lng(x) is the Shannon 
entropy of q. 

For q's which are product distributions S(q) = JA<S(<fe) where S(qi) = — ^ qi(%i) Inqi(xi). 
Accordingly, we can view the maxent Lagrangian as equivalent to a set of Lagrangians, C t (q) = 
Y^ x . [G(xi,X-i)\qi(xi)—TSi(qi), one such Lagrangian for each agent i so that C(q) = YH=i A(<?)- 5 
The first term in C is minimized by having perfectly rational players, i.e. by players who concen- 
trate all their probability on the moves that are best for them, given the distributions over the 
agents. The second term is minimized by perfectly irrational players, i.e., by a perfectly uniform 
joint mixed strategy q. So T specifies the balance between the rational and irrational behavior of 
the players. In particular, for T — > 0, by minimizing the Lagrangian we recover the Nash equilibria 
of the game. Alternatively, from a statistical physics perspective, where T is the temperature of 
the system, this maxent Lagrangian is simply the Gibbs free energy for the Hamiltonian G. 

Since we are interested in problems with constraints, we replace G in Eqs. (1) and (2) with 

C 

G(x) + ^ A a c a (x) (3) 

a — 1 

where G is the original objective function and the c a are the set of C equality constraint functions 
that are required to be equal to zero. The \ a are the Lagrange multipliers that are used to enforce 
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the constraints. Collectively, we refer to the Lagrange multipliers with the C-vector A. 

In CSP’s we take the original objective function to be the constant function 0. In addition, 
the constraints are all equality constraints, so a saddle point of the Lagrangian over the space of 
possible q and A is a solution of our problem. Note though that we don’t have to find the exact 
saddle point; in general sampling from a q close to the saddle point will give us the x’s we seek. 

An alternative approach to incorporating constraints would start by weakening them so that 
they can be violated. We would then iteratively anneal down those weaknesses, i.e., strengthen the 
constraints, to where they are not violated. In this approach we replace the maxent Lagrangian 
formulation encapsulated in Eq.’s (2) and (3) with 

C(q, f3, A) = (3[E q (G) - 7G ]+E A «[ E ^) - 7a] - S(q). (4) 

a 

In each iteration of the algorithm (3, A are treated as Lagrange parameters and one solves for their 
values that enforce the equality constraints E g (G) = 7 g, and the C constraints E g (c a ) = 7a while 
also minimizing C(q,/3, A). In the usual way, since our constraints are all equalities, one can do 
this by finding saddle points of C(q,f3, A). The next iteration would then start by modifying our 
Lagrangian by shrinking the values 7 c, { 7a } slightly before proceeding to a new process of finding 
a saddle point. 

For pedagogical simplicity, here we do not consider this alternative approach, but concentrate 
on the Lagrangian of Eq. (1) with the G of Eq. (3). Note that the vectors {q{\ must be probability 
distributions. So there are implicit constraints our solution must satisfy: 0 < qi(xi ) < 1 for all i 
and Xi, and Yh Xi Qi( x i) = 1 f° r all *• To reduce the size of our equations we do not explicitly write 
those constraints. 

3 Minimizing the maxent Lagrangian 

For fixed /3, our task is to find a saddle point of C(q, A). In “first order methods” such a saddle 
point is found by iterating a two-step process. In the first step the Lagrange parameters A are 
fixed and one solves for the q that minimizes the associated £. 6 In the second step one then freezes 
that q and updates the Lagrange parameters. There are more sophisticated ways of finding saddle 
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points (Grantham 2004), and more generally one can use modified versions of the Lagrangian (e.g., 
an augmented Lagrangian (Bertsekas 1996)). Here for pedagogical simplicity we do not consider 
such more sophisticated approaches. 

In this section we review two approaches to finding the {(/,;} for fixed Lagrange multipliers A. 
We also describe our approach for the second step of the first order method, i.e., we describe how we 
use gradient ascent to update the Lagrange multipliers X a for fixed q. See (Wolpert 2003a, Wolpert 
& Bieniawski 20046, Wolpert 20046) for further discussion of these approaches as well as the many 
others one can use. 

3.1 Brouwer Updating 

At each step t the direction in the simplex Q that, to first order, maximizes the drop in C is given 
by (-1 times) 

y q C{q) ± V q C{q) - r,(q). (5) 

In this equation the qi(xi) component of the gradient (one for every agent i and every possible 
move Xi by the agent) is 

dC 

[V q C{q)]q i{xi ) = dq .^ x) = Eg-i(G|xi) +Tln[q i (x i )\ (6) 

where 

E q _i{G\xi) = ^g_j(x_j)G(xj,x_j) 

X-i 

with x_j = [x i , • * * , Xj_i,Xj+i, • • • ,x n \ and g_j(x_i) = IT/=i|j/* Qj( x j)- ? ?(<?) is the vector that 
needs to be added to W q C(q) to get it back into Q.‘ The qi(xi ) component of ??((/), is equal to 

fo(g)]«(xo = m(q) = rw Z) ( 7 ) 

x i 

where |A)| is the number of possible moves x t . Not that for any agent i, all of the associated 
components of r](q), namely qi(x\), ■ ■ ■ , %(xi^i), share the same value r)i(q). This choice ensures 
that ^2 X . qi(xi) = 1 after the gradient update to the values qi(xi). 

The expression in Eq. (3.1) is the expected payoff to agent i when it plays move Xj, under the 
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distribution across the moves of all other agents. Setting V q C(q) to zero gives the solution 

Qi +1 ( x i) oc exp[-E q t_{G\xi) /T] (8) 

Brouwer’s fixed point theorem guarantees the solution of Eq. (8) exists for any G (Wolpert 2004a, 
Wolpert 2003a). Hence we call update rules based on this equation Brouwer updating. 

Brouwer updating can be done in parallel on all the agents. One problem that can arise here 
is “thrashing”. Each agent i is adopting the qi that would be optimal if all the other agents 
didn’t change their distributions. However they do change their distributions, and thereby at least 
partially confound agent i. One way to address this problem is to have agent i not use the current 
value E t {G\xf) alone to update qj(xi), but rather use a weighted average of all values E t < (G\xi) 

■* 1 — i 

for t' <t, with the weights shrinking the further into the past one goes. This introduces an inertia 
effect which helps to stabilize the updating. (Indeed, in the continuum-time limit, this weighting 
becomes the replicator dynamics (Wolpert 2004(f).) 

A similar idea is to have agent i use the current E q t ( G\xi ) alone, but have it only move part of 
the way the parallel Brouwer update recommends. Whether one moves all the way or only part-way, 
what agent i is interested in is what distribution will be optimal for the next distributions of the 
other agents. Accordingly, it makes sense to have agent i predict, using standard time-series tools, 
what those future distributions will be. This amounts to predicting what the next vector of values 
of E i ( Glxi ) will be, based on seeing how that vector has evolved in the recent past. See (Shamma 
& Arslan 2004) for related ideas. 

Another way of circumventing thrashing is to have the agents update their distributions serially 
(one after the other) rather than in parallel. See (Wolpert & Bieniawski 2004a) for a description of 
various kinds of serial schemes, as well as a discussion of partial serial, partial parallel algorithms. 

3.2 Nearest-Newton Updating 

To evaluate the gradient one only needs to evaluate or estimate the terms E t (G\xi) for all agents 
(see below and (Wolpert 2003a, Wolpert 2004a)). So gradient descent is typically straight-forward. 
It is also usually simple to evaluate the Hessian of our Lagrangian. However conventional Newton’s 
descent is often intractable for large systems, since that Hessian is so big that inverting often it 
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isn’t feasible. 


Of course there are schemes like conjugate gradient or quasi-Newton that can exploit second 
order information even when the Hessian cannot be inverted. However the special structure of the 
Lagrangian also allows second order information to be used for a simple variant of Newton descent. 
The associated update rule is called Nearest- Newton updating (Wolpert & Bieniawski 20046); we 
review it here. 

To derive Nearest-Newton we begin by considering the Lagrangian E 7r (G) — TS(ir), for an 
unrestricted probability distribution 7r. 8 This Lagrangian is a convex function of n with a diagonal 
Hessian. So given a current distribution 7r f we can make an unrestricted Newton step of this 
Lagrangian to a new distribution 7r i+1 . That new distribution typically is not in Q, even if the 
starting distribution is. However we can solve for the q t+l G Q that is nearest to 7r f+1 , for example 
by finding the q t+l G Q that minimizes qp KL distance D(p\\q) to that new point. 

More precisely, the Hessian of E^G) — TS(n), <9 2 £/<97r(x)cbr(x'), is diagonal, and so is simply 
inverted. This gives the Newton update for n 1 : 


TT 


(x) = 7 r*(x) - aX(x) 


G(x) - E w t(G) 


+ S( TT 1 ) + In 7T*(x) 


which is normalized if n 1 is normalized and where a * is a step size. As 7 d will typically not belong 
to Q we find the product distribution nearest to 7r i+1 by minimizing the KL distance D(ir t+1 \\q) 
with respect to q. The result is that qi(xi) = 7r * +1 (xj), i.e. g* is the marginal of n t+1 given by 
integrating it over x_j. 

Thus, whenever 7r* itself is a product distribution, the update rule for qi(xi) is 


^ +1 


(xi) = q\{xi) - alqfixi) 


E gL .(G|xi) - E qt (G) 


S(qi)+ In qj(xi 


(9) 


This update maintains the normalization of g*, but may make one or more q\ +1 (xi) greater than 
1 or less than 0. In such cases we set g,- +1 to be valid product distribution nearest in Euclidean 
distance (rather than KL distance) to the suggested Newton update. 
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3.3 Updating Lagrange Multipliers 

In order to satisfy the imposed optimization constraints {c a (x)} we must also update the Lagrange 
multipliers. To minimize communication between agents this is done in the simplest possible way 
- by gradient descent. Taking the partial derivatives with respect to X a gives the update rule 

A* +1 = A*+aiE g *(c a (x)) (10) 

where a\ is a step size and q\ is the local minimizer of C determined as above at the old settings, 
A 4 , of the multipliers. 

3.4 Other descent schemes 

It should be emphasized that PC encompasses many approaches to optimization of the Lagrangian 
that differ from those used here. For example, in (Bieniawski & Wolpert 2004a, Wolpert 2004d) 
there is discussion of alternative types of descent algorithms that are related to block relaxation, 
as well as to the fictitious play algorithm of game theory (Fudenberg & Tirole 1991, Shamma & 
Arslan 2004) and multi-agent reinforcement learning algorithms like those in collective intelligence 
(Wolpert & Turner 2001, Wolpert et al. 2003). 

As another example, see (Wolpert & Bieniawski 20046, Wolpert 2004a) for discussions of using 
pq KL distance (i.e., D(p\\q)) rather than qp distance. Interestingly, as discussed below, that alter- 
native distance must be used even for descent of qp distance, if one wishes to use 2nd order descent 
schemes. (Wolpert 20046, Wolpert 2004c) discusses using non-Boltzmann target distributions p, 
and many other options for what functional(s) to descend. 

4 Statistical estimation to update q 

Using either of the update rules Eqs. (8) or (9) requires knowing E i (GlxA, defined in Eq. (3.1). 
However, often we cannot efficiently calculate all the terms K q t (G\xi). To use our update rules in 
such situations we can use Monte Carlo sampling, as described in this section. 
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4.1 Monte Carlo sampling 


In the Monte Carlo approach, at each timestep every agent i samples its distribution qi to get a 
point x'j. Since we have a product distribution q, these samples provides us with a sample x of the 
full joint distribution q. By repeating this process L times we get a “block” of such joint samples. 
The G values in that block can be used by each agent separately to estimate its updating values 

(G|xj), for example simply by uniform averaging of the G values in the samples associated with 
each X{. Note that the single set of samples can be used no matter how many agents are in the 
system; we don’t need a different Monte Carlo process for each agent to estimate their separate 
E qt JG\ Xi ). 

All agents (variables) sample moves (variable settings) independently, and coupling occurs only 
in the updates of the q % . As we have seen this update (even to second order) for agent i depends 
only on the conditional expectations E q _ i (G\xi) where describes the strategies used by the 
other agents. Thus, if we are using Monte Carlo, then the only information which needs to be 
communicated to each agent is the G values upon which the estimate will be based. Using these 
values each agent independently updates its strategy (its qi) in a way which collectively is guaranteed 
to lower the Lagrangian. 

If the expectation is evaluated analytically, the ith agent needs the qj distributions for each of 
the j agents involved in factors in G that also involve i. For objective functions which consists of 
a sum of local interactions each of which individually involves only a small subset of the variables 
(e.g. the problems considered here), the number of agents that i needs to communicate with may 
be much smaller than n. 

4.2 Difference utilities for faster Monte Carlo convergence 

The basic Monte Carlo approach outlined above can be slow to converge in high- dimensional prob- 
lems. For the problems considered in this paper this is irrelevant, since (G\xi) may be efficiently 
calculated in closed form for all agents i and their moves Xi, so we don’t need to use Monte Carlo 
sampling. For completeness though here we review a variant of the basic Monte Carlo approach 
that converges far more quickly. See (Wolpert 2003a, Wolpert & Bieniawski 20046) for details. 

Say we are at a timestep t at the end of a Monte Carlo block, and consider the simplest updating 
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rule. This is gradient descent updating, in which we wish to update g* at a timestep t by having 
each agent i take a small step in the direction (cf. Eq. (5)) 


£2,Ct _A 


dC(q t ) <9£(g*) 

_d qi (x})' ’ d qi (x l * il ). 


Vi(q ) 1 


where rji(q ) was defined in Eq. (7), 1 is the vector of length \Xj\ all of whose components are 1, 
and x \ , • • • ,x\ 1 are the \Xi\ moves available to agent i. In general, there will be some error in 
i's estimate of that step, since it has limited information about q t _ i . Presuming quadratic loss 
reflects quality of the update, for agent i the Bayes-optimal estimate of its update is the posterior 
expectation 

f dq t P(q-i | rii) f i,G 

where rii is all the prior knowledge and data that i has, and the dependence of f i,G on qP is 
implicit. 9 P(q t _ i \rii) is a probability distribution over likely values of qP given the information n* 
available to agent i. 

Now agent i can evaluate In qi{x{) for each of its moves Xi exactly. However to perform its 
update it still needs the integrals 


J dq t Piqtiln^Egt^Glxi) 

(recall Eq. (6)). In general these integrals can be very difficult to evaluate. As an alternative, we 
can replace those integrals with simple maximum likelihood estimators of them, i.e., we can use 
Monte Carlo sampling. In this case, the prior information, m, available to the agent is a list, £, of 
L joint configurations x along with their accompanying objective values G(x). 

To define this precisely, for any function /r(x), let h(rq) be a vector of length \Xi\ which is 
indexed by Xi. The Xi component of h(nj) is indicated as h Xi (m). Each of its components is given 
by the information in rii. The Xj’th such component is the empirical average of the values that h 
had in the L Xi samples from the just-completed Monte Carlo block when agent i made move x^, 
i.e. 

h Xi (m) = -jr- *52 h(x) 

Xi xfEfc, 


17 



where £ Xi is the set of x in £ whose ith component is equal to x t , and where L Xi = £ Xi \ . Given 
this notation, we can express the components of the gradient update step for agent i under the 
simple maximum likelihood estimator as the values 

fxf( n i) = -{G Xi {ni) +T\nqi{xi)} - fj(rii ) ( 11 ) 

where 

mint) - r^^{G x '.(ni) +T\nq i (x' i )}. (12) 

1 ll A 

Unfortunately, often in very large systems the convergence of G(rii ) with growing L is very 
slow, since the distribution sampled by the Monte Carlo process to produce rq is very broad. This 
suggests we use some alternative estimator. Here we focus on estimators that are still maximum 
likelihood, just with a different choice of utility. To that end, first posit that the differences 
E q t {G\xi) — K q t (G one for each (x*, x'-) pair, are unchanged when one replaces G with some 
other function g t . So the change is equivalent to adding a constant to G , as far as those differences 
are concerned. This means that if agent i used qG to evaluate its expectation values exactly, then 
its associated update would be unchanged under this replacement. (This is due to cancellation 
of the equivalent additive constant with the change that arises in i]i(rii) under the replacement of 
G(x) with g 1 (x)). It is straight-forward to verify that the set of all g l guaranteed to have this 
character, regardless of the form of q. is the set of difference utilities, g l (x) = G(x) — ZT(x_i) for 
some function D 1 . G itself is the trivial case T)*(x_j) = 0 Vcc*. 

On the other hand, if we use a difference utility rather than G in our maximum likelihood 
estimator then it is the sample values of P(g 2 ) that generate rq, and we use the associated x,- 
indexed vector g Xi {ni ) rather than G Xi (rii) to update each q t . For well-chosen D l it may typically 
be the case that g l (ni) has a far smaller standard deviation than does G(nf). In particular, if 
the number of coordinates coupled to i through G does not grow as the system does, often such 
difference utility error bars will not grow much with system size, whereas the error bars associated 
with G will grow greatly. Another advantage of difference utilities is that very often the Monte 
Carlo values of a difference utility are far easier to evaluate than are those of G, due to cancellation 
in subtracting D l . 
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To make this more precise we can solve for the difference utility with minimal error bar. First 
as a notational matter, extend the definition of f i,& by replacing G with (arbitrary) h throughout, 
writing that extended version as f t,h . Then assuming no L Xi = 0, we are interested in the g l 
minimizing the data-averaged quadratic error, 


E 


q-i,n a i 


(Err) = / dq-i P(q-i) / dnf P(nf |nf , q_ u g l )\\P^ -f l ’ 9 \rii) || s 


(13) 


where P(q~i ) reflects any prior information we might have concerning q-i (e.g., that it is likely that 

' X . rti 

the current f*’ 3 is close to that estimated for the previous block of L steps), and nf is the set of 
values of the private utility contained in rq. (The associated x, L values, nf, are independent of g l 
and q-i and therefore for our purposes can be treated as though they are fixed.) 

Now the components of f l ’ 9 ‘ (nt) (one for each Xi) are not independent in general, being coupled 
via fji(ni). To get an integrand that involves only independent variables, we must work with only 
one Xi component at a time. To that end, rewrite the data-averaged quadratic error as 


Y J dq-i P(q-i ) 

Xj, 


dnf P{n(\n*,q-i,g l )[f l x f - f x f (m)f 


where fxf is the qpXi) component of P' 9 * . Our results will hold for all g_j, so we ignore the outer 
integral and focus on 


E 


dnf 


P(n(\n*,q-i,gn[fif-Pf(ni)f 


(14) 


For any x* the inner integral can be decomposed with the famous bias-variance decomposition into 
a sum of two terms (Duda, Hart Sz Stork 2000). 10 The first of the two terms in our sum is the 
(square of the) bias , fxf — E gi(f£f ), where 


^(/zf ( n f)) = / dnf P(n(\nf, q-i, g l )f x f(ni 


(15) 


is the expectation (over all possible sets of Monte Carlo sample utility values nf) of f x f (rii). The 
bias reflects the systematic trend of our sample-based estimate of f x f to differ from the actual 
fxf ■ When bias is zero, we know that on average our estimator will return the actual value it’s 
estimating. 
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The second term in our sum is the variance, 


Var(/*f(nf)) = / dnf P(nf \nf , q- t , g* ) {f x f (rq) - E < (/*f )}‘ 


(16) 


In general variance reflects how much the value of our estimate “bounces around” if one were to 
resample our Monte Carlo block. In our context, it reflects how much the private utility of agent 
i depends on its own move X{ versus the moves of the other agents. When V s estimator is isolated 
from the moves of the other agents fxf (rq) is mostly independent of the moves of the other agents, 
and therefore of rq. This means that variance is low, and there is a crisp “signal-to- noise” guiding 
V s updates. In this situation the agent can achieve a preset accuracy level in its updating with a 
minimal total number of samples in the Monte Carlo block. 

Plug the general form for a difference utility into the formula for fxf (rif) to see that (due to 
cancellation with the fj(rii ) term) its nf -averaged value is independent of D'. Accordingly bias must 
equal 0 for difference utilities. (In fact, difference utilities are the only utility that is guaranteed to 
have zero bias for all q~i-) So our expected error reduces to the sum over all Xi of the variance for 
each x t . 

For each one of those variances again use Eq. 11 with G replaced by g l throughout to expand 
fxf (rq). Since the q-jfxi) terms in that expansion are all the same constant independent of rq, they 
don’t contribute to the variance. Accordingly we have 


Var(/’f(nf)) = Var 


UO - 


Z_j Vx. 


9 X '. \ n i 


= Var 


1 - 


1 


9 l xM) 

1 l| A+xi 


(17) 


Since nf is fixed and we are doing IID sampling, the two expressions inside the variance function 
are statistically independent. In addition, the variance of a difference of independent variables is 
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the sum of the variances. Accordingly, the sum over all Xi of our variances is (cf. Eq. (14)) 


5^Var(/*f (nf)) 

Xi 



Xi 1 — 1 
1^1 




(18) 


where the third equation follows from the second by using the trivial identity XX XX^a F (P) = 
XX F ( a ) XX^a 1 for an y function F. 

Since for each such x* we are doing L^-fold IID sampling of an associated fixed distribution, 
the variance for each separate Xi is of the form 


J dy P( y) 


— ^ yj -E P [ — 


Y L xi 


3 = 1 


Xi 



1 2 


for a fixed distribution P(yi, y 2 , ■ ■ ■ , y Lr ) = Y^j=\ F {Vj)- We can again use the decomposition of a 
variance of a sum into a sum of variances to evaluate this. With the distribution q t implicit, define 
the single sample variance for the value of any function H(x), for move x t , as 


Var (H( Xi )) = E([ H] 2 \ Xi ) - [^(i^la;,)] 2 - 


(19) 


This gives 


Var($..«)) = Var (g\ Xi )) / L x 


(20) 


Collecting terms, we get 


5Z Var (^f*K)) 

Xi 


Xj\ - 1 Var (g l ( X j)) 

\Xi\ 4- L 


( 21 ) 


Now Var(A(r)) = (1/2) XX j t 2 F (ti)P(t 2 )[A(ti) — Afo)} 2 for any random variable r with dis- 
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tribution P(t). Use this to rewrite the sum in Eq. (21) as 


1 ^ 1 -1 

2\Xi\ 


E 

Xi x'_ . ,x" , 


0*(®i,x'U)] 


2 


Bring the sum over a?j inside the other integrals, expand g l , and drop the overall multiplicative 
constant to get 


Var(FMnf)) oc E 

— i — z 

y- [g(aj,xU) - G(xj,xU) - {PtxU) - ^(x'U )}] 2 

rf. . ^Z 

•A'Z 

For each xU and x(U, our choice of D l minimizes the sum so long as the difference in the curly 
brackets obeys 


z^(xU) - sV-i) = E [G(a:i,x/ ~ i) r G(a:i,x " i)] / V j- 




U 7 




This can be assured by picking 


^(x-i) = 


E — 


n -i 


E 


Gjx'nx-j) 

L„i 


( 22 ) 


for all x_j. The associated difference utility, g*(x) = G(x) — Z?*(x_j), is called the Aristocrat utility 
(AU). An approximation to it was investigated in (Wolpert &; Turner 2001, Wolpert, Turner & 
Bandari 2002, Wolpert & Turner 2002) and references therein. AU itself was derived in (Wolpert 
20036). 

Note that AU minimizes variance of gradient descent updating regardless of the form of q. 
Indeed, being independent of q_ t , it minimizes our original q_ i integral in Eq. (13), regardless of 
the prior P(q_ i ). For the same reason it is optimal if the integral is replaced by a worst-case bound 
over q_ i . 

Sometimes not all the terms in the sum in AU can be stored, because \Xf\ and/or the block size 
is too large. In such a case nf must be averaged over as well as nf. That sum can be approximated 
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by replacing the L Xi values in the definition of AU with q(xi)L. This replacement also provides a 
way to address cases where one or more L Xi = 0. 11 Similarly, for computational reasons it may be 
desirable to approximate the weighted average of G over all x \ which defines AU. 

The sum over x\ occurring in AU should not be confused with the sum over x\ that pulls our 
gradient estimate back into the unit simplex. The sum here is over values of G for counterfactuals 
sample pairs (x[, x_j). (The other sum is over values of our gradient estimate at all of its arguments.) 
When the functional form of G is known it is often the case that there is cancellation which allows 
AU be calculated directly, in one evaluation, an evaluation which can be cheaper than that of G(x). 
When this is not the case their evaluation incurs a computational cost in general. 12 This cost is 
offset by the fact that those evaluations allow us to determine the value of AU not just for the 
actual point (xj,x_j), but in fact for all points {(a^,x_j) \x\ € A)}. 

Nonetheless, there will be cases where evaluating AU requires evaluating all possible G(x',x_j), 
and where the cost of that is prohibitive, even if it allows us to update AU for all x t at once. 
Fortunately there are difference utilities that are cheaper to evaluate than AU while still having 
less variance than G. In particular, note that the weighting factor A - , 1 / XX" hr the formula for 
AU is largest for those Xi which occur infrequently, i.e. that have low qiixf). This observation leads 
to the Wonderful Life Utility (WLU), which is an approximation to AU that (being a difference 
utility) also has zero bias: 


gY LU {xi, x_j) = G(xi, x_j)) - G(xf amp , x_*). (23) 

In this formula, x Plamp = arg min x , L Xi or if we wish to be more conservative, arg min x . qi(xi ), 
agent i’s lowest probability move (Wolpert 2003a, Wolpert 2004a). 13 

4.3 Discussion of Monte Carlo sampling 

Note that the foregoing analysis breaks down if any of the L Xi = 0. More generally it may break 
down if just one or more of the q{xf) are particularly small in comparison to the others, even if no L Xi 
is exactly zero. The reason for this is that our approximation of the average over nf with the average 
where no L Xi = 0 breaks down. Doing the exact calculation with no such approximation doesn’t 
fix the situation — once we have to assign non-infinitesimal probability to L Xi = 0, we’re allowing 
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a situation in which the gradient step would take us off the simplex of allowed q € Q- We might 
try to compensate for this by reducing the stepsize, but in general the foregoing analysis doesn’t 
hold if stepsize is reduced in some situations but not in any others. (Variable stepsize constitutes 
a change to the update rule. Such a modification to the update rule must be incorporated into the 
analysis — which obviates the derivation of AU.) 

One way to address this scenario would be to simply zero out the probability of agent i making 
any move x'j for which qi(xi) is particular small. In other words, we can redefine i’s move space to 
exclude any moves if their probability ever gets sufficiently small. This has the additional advantage 
of reducing the amount of “noise” that agents j 7^ i will see in the next Monte Carlo block, since 
the effect of agent i on the value of G in that block is more tightly constrained. 

There several ways to extend the derivation of AU, which only addresses estimation error for 
a single agent at a time, and for just that agent’s current update. One such extension is to have 
agent i’s utility set to improve the accuracy of the update estimation for agents j 7^ i. For example, 
we could try to bias qi to be peaked about only a few moves, thereby reducing the amount of noise 
those other agents j 7^ i will see in the next Monte Carlo block due to variability in i’s move choice. 
Another extension is to have agent i’s utility set to improve the accuracy of its estimate of its 
update for future Monte Carlo blocks, even at the expense of accuracy for the current block.. 

Strictly speaking, the derivation of AU only applies to gradient descent updating of q. Difference 
utilities are unbiased estimators for the Nearest Newton update rule, so long as each i estimates 
E q t(g l ) as q\{xi ) times the estimate of E q t(g l \xi), 


^9xMi)4{xi) 

Xi 

rather than as the empirical average over all samples of g \ 14 

Xi 

However the calculation for how to minimize the variance must be modified. Redoing the algebra 
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above, the analog of AU for the Nearest Newton rule arises if we replace 


J— - { [1 - afe)] 2 + V [«(x')] 2 } (24) 

U ‘ L *‘ 1 *'#*, 1 

throughout the equation defining AU. Similar considerations apply to Brouwer updating as well. 
Nonetheless, in practice AU and WLU as defined above work well (and in particular far better than 
taking g l = G ) for the other updating rules as well. 

For gradient descent updating, minimizing expected quadratic error of our estimator of E t (G\xi) 
corresponds to making a quadratic approximation to the Lagrangian surface, and then minimizing 
the expected value of the Lagrangian after the gradient step (Wolpert 2003a). More generally, 
and especially for other update rules, some other kind of error measure might be preferable. Such 
measures would differ from the bias-variance decomposition. We do not consider such alternatives 
here. 

Note that the agents are completely “blind” in the Monte Carlo process outline above, getting 
no information from other agents other than the values of G'(x). When we allow some information 
to be transmitted between the agents we can improve the estimation of E g * ( G\x { ) beyond that 
of the simple Monte Carlo process outlined above. For example, say that at every timestep the 
agent i knows not just its own move Xj, but in fact the joint move x. Then as time goes on it 
accumulates a training set of pairs {(x, G'(x))}. These can be used with conventional supervised 
learning algorithms (Duda et al. 2000) to form a rough estimate of the entire function G , G. Say 
that in addition i knows not its own distribution c/,;(x'(), but in fact the entire joint distribution, 
^(x*). Then it can use that joint distribution together with G to form an estimate of E^t (G\xi). 
That estimate is in addition to the one formed by the blind Monte Carlo process outlined above. 
One can then combine these estimates to form one superior to both. See (Lee & Wolpert 2004). 

Even when we are restricted to a blind Monte Carlo process, there are many heuristics that 
when incorporated into the update rules that can greatly improve their performance on real-world 
problems (Wolpert 2004c). In this paper we examine problems for which joint distributions q are 
known to all agents as well as the function form of G(x) and the required expectations E^t [G\xi) 
may be obtained in closed form. So there is no need for Monte Carlo approximations. Accordingly 
there is no need for those heuristics, and there is not even any need to using difference utilities. 
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Empirical investigations of the effects of using difference utility functions and the heuristics may 
be found in (Bieniawski Sz Wolpert 2004a, Bieniawski et al. 2004, Bieniawski & Wolpert 20046). 

5 Semicoordinate Transformations 

5.1 Motivation 

Consider a multi-stage game like chess, with the stages (i.e., the instants at which one of the players 
makes a move) delineated by t. In game theoretic terms, the “strategy” of a player is the mapping 
from board-configuration to response that specifies the rule it adopts before play starts (Fudenberg 

6 Tirole 1991, Basar & Olsder 1999, Osborne & Rubenstein 1994, Aumann Sz Hart 1992, Fudenberg 
& Levine 1998). More generally, in a multi-stage game like chess the strategy of player i, Xi, is the 
set of t-indexed maps taking what that player has observed in the stages t' < t into its move at 
stage t. Formally, this set of maps is called player V s normal form strategy. 

The joint strategy of the two players in chess sets their joint move-sequence, though in gen- 
eral the reverse need not be true. In addition, one can always find a joint strategy to result in 
any particular joint move-sequence. Now typically at any stage there is overlap in what the play- 
ers have observed over the preceding stages. This means that even if the players’ strategies are 
statistically independent (being separately set before play started), their move sequences are statis- 
tically coupled. In such a situation, by parameterizing the space Z of joint-move-sequences z with 
joint-strategies x, we shift our focus from the coupled distribution P( z) to the decoupled product 
distribution, q(x). This is the advantage of casting multi-stage games in terms of normal form 
strategies. 

More generally, given any two spaces X and Z , any associated onto mapping ( : Z — > X, 
not necessarily invertible, is called a semicoordinate system. The identity mapping Z — > Z is 
a trivial example of a semicoordinate system. Another semicoordinate system is the mapping 
from joint-strategies in a multi-stage game to joint move-sequences. In other words, changing the 
representation space of a multi-stage game from move-sequences z to strategies x is a semicoordinate 
transformation of that game. 

Intuitively, a semi-coordinate transformation is a reparameterization of how a game — a map- 
ping from joint moves to associated payoffs — is represented. So we can perform a semicoordinate 
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transformation even in a single-stage game. Say we restrict attention to distributions over X that 
are product distributions. Then changing £(•) from the identity map to some other function means 
that the players’ moves are no longer independent. After the transformation their move choices 
the components of z — are statistically coupled, even though we are considering a product 
distribution. 

Formally, this is expressed via the standard rule for transforming probabilities, 

Pz{ z € Z) = £ (Px) = J dx P x (x)5(z - £(x)), (25) 

where Px and Pz are the distributions across X and Z , respectively. To see what this rule means 
geometrically, recall that V is the space of all distributions (product or otherwise) over Z and that 
Q is the space of all product distributions over X. Let £(Q) be the image of Q in V. Then by 
changing £(•), we change that image; different choices of £(•) will result in different manifolds £(Q). 

As an example, say we have two players, with two possible moves each. So z consists of the 
possible joint moves, labelled (0, 0), (0, 1), (1, 0) and (1,1). Have X = Z, and choose £(0,0) = 
(0,0), C(0, 1) = (1,1), C(1,0) = (1,0), and £(1,1) = (0,1). Say that q is given by qi(xi = 
0) = </2 (^2 = 0) = 2/3. Then the distribution over joint-moves z is Pz(0,0) = P x (0,0) = 4/9, 
Pz(l,0) = Pz{ 1, 1) = 2/9, Pz{ 0, 1) = 1/9. So P z (z) / Pz(z\)Pz(z 2 )', the moves of the players are 
statistically coupled, even though their strategies Xi are independent. 

Any Pz, no matter what the coupling among its components, can be expressed as £ (Px) for 
some product distribution Px for and associated £(•) In the worst case, one can simply choose X to 
have a single component, with £(•) a bijection between that component and the vector z — trivially, 
any distribution over such an X is a product distribution. Another simple example is where one 
aggregates one or more agents into a new single agent, i.e. , replaces the product distribution over 
the joint moves of those agents with an arbitrary distribution over their joint moves. This is related 
to the concept coalitions in cooperative game theory, as well as to Aumann’s correlated equilibrium 
(Fudenberg & Tirole 1991, Aumann 1987, Aumann & Hart 1992). 

Less trivially, given any model class of distributions {Pz}, there is an X and associated £(•) such 
that {Pz} is identical to £(Q_y). Formally this is expressed in a result concerning Bayes nets. For 
simplicity, restrict attention to finite Z. Order the components of Z from 1 to N . For each index 
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i E {1, 2 ,N}, have the parent function Pa(?', z) fix a subset of the components of z with index 
greater than i, returning the value of those components for the z in its second argument if that 
subset of components is non-empty. So for example, with N > 5, we could have Pa(l, z) = (z 2 , 25 ). 
Another possibility is that Pa(l,z) is the empty set, independent of z. 

Let A(Pa) be the set of all probability distributions Pz that obey the conditional dependencies 
implied by Pa: V Pz E A(Pa),z E Z , 

N 

p z (z) = X\Pz{ Zi \V(i,z)). (26) 

i— 1 

By definition, if Pa(i,z)) is empty, Pz(zi\ Pa(i, z)) is just the Pth marginal of Pz, Pz(zi)- As 
an example of these definitions, the dependencies {Pa(l,z) = (z 2 , z 3 ), Pa(2, z) = Z 4 ,Pa( 3 ,z) = 
(),Pa(4, z) = ()} correspond to the family of distributions factoring as 

P( Z) = P(zi\z 2 , Z 3 )P(Z2\Z4)P(Z 3 )P(Z4) 

As proven in (Wolpert & Bieniawski 2004a), for any choice of Pa there is an associated set of 
distributions ((Qx) that equals A(Pa) exactly: 

Proposition: Define the components of X using multiple indices: For all i E {1,2,..., iV} and 
possible associated values (as one varies over z E Z) of the vector Pa(i,z), there is a separate 
component of x, ®j ; p a (j )Z ). This component can take on any of the values that z* can. Define £(•) 
recursively, starting at i = N and working to lower i, by the following rule: V i E {1, 2, . . . , N}, 

[C (■*■)]* ■ 


Then A(Pa) = C (Qx). 

Intuitively, each component of x in Prop. 1 is the conditional distribution Pz(zi\ Pa(i, z)) 
for some particular instance of the vector Pa(pz). As illustration consider again the example 
{Pa(l,z) = (z 2 , z 3 ), Pa(2, z) = Z 4 ,Pa( 3 ,z) = (),Pa(4, z) = ()}. If each Zi assumes the value 0 or 
1 , then x has 8 components X 4 ,x 3 ,X 2 -o,X 2 -i,xi ] qo,xi-oi,xi-io, and x\-n with each component also 
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x\ — 0 

X\ = 1 

(0,0,0), (0,0,1), (0,1,0), (0,1,1) 
(1,0,0), (1,0,1), (1,1,0), (1,1,1) 

z\ = 0 
21 = 1 

(1,0,0), (0,1,0), (0,0,1), (1,1,1) 
(0,0,0), (1,1,0), (1,0,1), (0,1,1) 

to to 

II II 
0 

(0,0,0), (0,0,1), (1,0,0), (1,0,1) 
(0,1,0), (0,1,1), (1,1,0), (1,1,1) 

22 = 0 
22 = 1 

(1,0,0), (0,1,0), (1,0,1), (0,1,1) 
(0,0,0), (1,1,0), (0,0,1), (1,1,1) 

CO CO 

II II 
0 

(0,0,0), (0,1,0), (1,0,0), (1,1,0) 
(0,0,1), (0,1,1), (1,0,1), (1,1,1) 

23 = 0 

23 = 1 

(1,0,0), (0,1,0), (1,0,1), (0,1,1) 
(0,0,0), (1,1,0), (0,0,1), (1,1,1) 


Table 1: Resultant partitions from the transformation of Figure 1(b). 
either 0 or 1. The product distribution in X is 

<?(x) = g'4(*4)93(^3)Q , 2;0(-'C2;0)Q , 2;l(.'C2;l)g , l;00(*l;00)g , l;0l(®l;0l)gi;10(-'Cl;10)9l;ll(a:i;ll)- 

Under ( the distribution ^4(0:4) is mapped to (74(2:4), (/2;o(^2 ; o) is mapped to 42 (22 (24 = 0), (7i ; oi(*i;Oi) 
is mapped to qi(z\ \z^ = 0, Z 3 = 1), and so on. 

Prop. 1 means that in principle we never need consider coupled distributions. It suffices to 
restrict attention to product distributions, so long as we use an appropriate semicoordinate system. 
As we shall see, mixture models over Z can be also be represented using products. However, before 
discussing mixture models we show how transformation of semicoordinate systems can in principle 
be used to escape local minima in L(q). 

5.2 Semicoordinate transformations and local minima 

To illustrate another application of semicoordinate transformations, we confine ourselves to the 
case where X = Z so that C, is a bijection on X. 

We assume that the domain of the zth of n variables has size | A) | . Then | | = n" =1 \Xj is the 

size of the search space. Each coordinate variable Xi partitions the search space into \X t \ disjoint 
regions. The partitions are such that the intersection over all variable coordinates yields a single x. 
In particular, the standard semicoordinate system relies on the partition [*, • • • , *, Xi = 0, *, ■ ■ ■ , *], 
• Xi = \Xi\ — 1, *, • • • , *] for each coordinate x*. 

As a illustrative example, consider 3 binary variables where X = {0, l} 3 . Figure 1(a) shows 
the 8 points in the search space represented in the standard coordinate system. Figure 1(b) shows 
a shuffling of the 8 configurations under the permutation (01234567) -4 (15264073). The 
resulting partitions of configurations are given in Table 1. 
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X\ 



(a) 


zi 



(b) 


Figure 1: (a) Original linear indexing for 3 binary variables xi,X 2 ,X 3 - (b) Result after applying 
the transformation to the new variables zi , Z 2 , Z 3 . 

Such transformations can be used to escape from local minima of the Lagrangian. To see this 
consider a coordinate transformation ( from X to the new space Z such that z = £(x), and choose 
q( z) = g(x) (i.e. do not change the associated probabilities). Then the entropy contribution to the 
Lagrangian remains unchanged, but the expected G alters from )T) X G(x)g(x) to 15 

G x(x)q(x ) - J2G(((x.))q{x.) = ^G(C(z))g(z). 

XX Z 

This means that the gradient of the rnaxent Lagrangian will typically differ before and after the ap- 
plication of £. In particular, what was a local minimum with zero gradient before the semicoordinate 
transformation may not be a local minimum after the transformation and the resultant shuffling 
of utility values. As difficult problems typically have many local minima in their Lagrangian, such 
semicoordinate transformations may prove very useful. 

A simple example is shown in 2(a) where a Lagrangian surface for 2 binary variables is shown. 
The utility values are G(0, 0) = 0, G( 1, 0) = 25, (7(0, 1) = 18, G{ 1, 1) = 2. If the temperature is 7 in 
units of the objective then the global minimum is at gi(0) = 0.95, 52 ( 0 ) = 0.91 where L = —0.82. 
At this temperature there is a suboptimal local minimum (indicated by the dot in the lower left) 
located at gj'(O) = 0.14, g^O) = 0.08 where L = 0.83. 

There are a number of criteria that might be used to determine a semicoordinate transforma- 
tion to escape from this local minimum q* . Two simple choices are to select the transformation 
that minimizes the new value of the rnaxent Lagrangian (i.e., minimize L(q*)), or to select the 
transformation which results in the largest gradient of the rnaxent Lagrangian at q*, (i,e, , maxi- 
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L = 0.83, grad L = [0.00 -0.00] 



L = -1 .66, grad L = [-14.13 -0.10] 



L = 15.49, grad L = [-34.65 -34.68] 



Figure 2: (a) Original Lagrangian function. The suboptimal local minimum is located at q* which 
is indicated with a dot in the lower left corner, (b) The Lagrangian under the coordinate trans- 
formation which minimizes L(q*). (c) The Lagrangian under the coordinate transformation which 
maximizes the norm of the gradient at q*. The direction of the negative gradient is indicated by a 
black arrow. 
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mize ||V,L(g*)||). For this simple problem the results of both these choices are shown as Figures 
2(b) and 2(c) respectively. The transformation in each of these cases can be determined from the 
shuffling of G values. 

5.3 Semicoordinate Transformations for Mixture Distributions 

We have described how the Lagrangian measuring the distance of a product (mean field) distribution 
to a Boltzmann distribution may be minimized in a distributed fashion. We now extend these results 
to mixtures of product distributions, in order to represent multiple product distribution solutions at 
once. As mentioned above, we can always do that by means of a semicoordinate transformation of 
the underlying variables, allowing us to express that mixture distribution as the image of a product 
distribution over a different space. In this section we demonstrate this explicitly. 

Let x £ X indicate the new set of variables in a space of dimension dx, where z £ Z is the 
original (pre-transformation) space over which G is defined. Then a product distribution over X 
(where dx > n, the dimension of Z), and an appropriately chosen mapping £ : X — > Z induces a 
mixture distribution over Z. 

To see this consider an M component mixture distribution over n variables, which we write as: 
q{z) = Ylm=i < 7 °(?™)< 7 m (z) with J2m= iQ°( m ) = 1 an d q m {z) = IIILi qT^)- We can express this 
q{ z) as (the image of) a product distribution over a space X of dimension dx = 1 + Mn. Intuitively, 
the first dimension of X (indicated as x° € [1 , Mj) labels the mixture, and the remaining A'In 
dimensions (indicated as x™ & Z t ) correspond to each of the original n dimensions for each of the 
M mixtures. 

More precisely, write out the T-space product distribution as qx(x) = q°(x°) rim=i 
with g m (x m ) = l\h f° r x = It 0 ) x1 ) ' ' ‘ > xM ] an d x m = [x™, • • • The density in X 

and Z are related as usual by q( z) = ]F X qx(x)S(z— C( x )) for our vector-valued mapping £ : X — > Z, 
with the delta function of vectors being understood component-wise. If we label the components 
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of C so that Zi = (i(x°, x 1 , • • • , x M ) = xf° we find 


g(z) = ^ g °(x°) ]r 

x° x 1 ,-" ,x M m j 

=e«v> e 

a ; 0 x 1 ,— ,x M m i 

= e «v> e <fv°) n s (» - 4 °) 

rr° x l0 * 

= £ g 0 (sV°(z) 

a ; 0 

Thus, under £ the product distribution g^ is mapped to the mixture of products g(z) = ]T) m q°(m)q m ( z) 
(after relabelling x° to to), as desired. 

The maxent Lagrangian of the <T product distribution g^(x) is 


M 

C{q X ) = ]T q°(rn)E q m(G) - t[%°) + £ S(g m ) 

m=l 


This Lagrangian contains a term pushing us (as we search for the minimizer of that Lagrangian) to 
maximize the entropy of the mixture weights. However, it provides no incentive for the distributions 
q m to differ from each other. 

If we wish to have the q m differ from one another, we can instead consider the maxent Lagrangian 
over g(z). In this case 


£z(q) = Y G ^ q ^ ~~ TS “ TS (9) 

Z X 

= ^g°(TO)E <?m (G) -TsY^ g 0 (?B)g m (z)V 

m ' m ' 

The entropy term differs crucially in these two maxent Lagrangians. To see this add and subtract 
T Ylm g° (to) S (q m ) to the Z Lagrangian to find 

C z (q) = Y ( l 0 (m)jr(q m )- TJ <,q) (27) 

m 

where each C{q m ) is a maxent Lagrangian as given by Eq. (1), and J{q) > 0 is a modified version 
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of the Jensen-Shannon (JS) distance, 


J(q) = S(J2 q°(m)q m ) - E ?V)S(? m ) = - E E Q°(rn)q m (z) In 4^y- 

in in m z Q V^ 1 ) 

Conventional Jensen Shannon distance is defined to compare two distributions to each other, and 
gives those distributions equal weight (Lin 1991). In contrast, the generalized JS distance J(q) 
concerns multiple distributions, and weights them nonuniformly, according to q°(m). 

J(q) is maximized when the q m are all different from each other. Thus its inclusion in our 
Lagrangian pushes us to have the mixing components {q m (x) \m = 1, . . . , M} be far apart (in X) 
from one another. In this, we can view Eq. (27) as a novel derivation of (a generalized version of) 
Jensen Shannon distance. Unfortunately, it also couples all of the variables (because of the sum 
inside the logarithm), preventing a highly distributed solution. 

To address this, in this paper we replace J(q) in Cz{q) with another function which lower- 
bounds J{q) but which requires less communication between agents. It is this modified Lagrangian 
that we will minimize. 

5.4 A Variational Lagrangian 

Following (Jaakkola &; Jordan 1998), we introduce M variational functions w(z\m) and lower-bound 
the true JS distance with 


j(< i) 


-EE q°(m)q m (z) In 

m z 

= EE q°(m)q m ( z) lnw;( 


1 4Umi W(z l m > g(z) 

w(z\m) q°(m)q m ( z) 

z|m)) — E^9 0 ( m ) lng°(m) 


m z 


m 


-EE q°(m)q m {z) In 

m z 


w{z\m)q{z) 
q°(m)q m ( z) ' 
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Now replace M of the — In terms with the lower bound — In z > — uz + In v + 1 obtained from the 
Legendre dual of the logarithm to find 


J(q ) > J(q, w, u) = ^ ^ q°(m)q m (z) lnu>(z|m) — ^ q°{m) lng°(m) 

m z m 

- v m ^2 w(z\m)q(z) + ^ q°(m) In v m + 1. 

m z m 

Optimization over w and v maximizes this lower bound. To further aid in distributing the algorithm 
we restrict the class of variational w(z\m) to products: w(z\m) = fL Wi(zi\m). For this choice 


J(q, w, v) = J2 q°(m) \ B m ’ m - £ A m ^'u fn + In v m \ + S(q°) + 1 


(28) 


where A™ ,m = Yl Zi qr( z i) w i( z i\ m )> A m,m ~ Ilf=i A™’ m , B™ ,m = J2 Zi q™( z i) ln w i{ z i\ m ), and 
B m ’ rn A 16 At any temperature T the variational Lagrangian which must be mini- 

mized with respect to q. w and v (subject to appropriate positivity and normalization constraints) 
is then 

£z{q,w, u) = ^2q°(m)£(q m ) - TJ(q,w,u ) (29) 


with J(q,w,u) given by Eq. (28). 


6 Minimizing the Mixture Distribution Lagrangian 

Equating the gradients with respect to w and v to zero gives 

(30) 

(31) 

The dependence of Lz on q°(m) is particularly simple: Cz{q , w , v) « ^2 m q 0 (m)£(m)—T(S(q°)+ 1) 
up to ^-independent terms and where 

£(m) 4 E q m(G) - T ( S[q m ] + B m ’ m - ^ A m ^^ % + ln v m 



q°(rh)A fn ' m . 

q°(m) ^ 


Wi(zi\m ) oc 


q°(m)q^(zi 


E<I 0 (A)<jf(«) 


A " 


n -1 


j^m,m 
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Thus, the mixture weights are Boltzmann distributed with energy function £{m): 


exp (— £ (m)/T) 
Em ex p (— £{™)/ T )' 


(32) 


The determination of q™(zi) is similar. The relevant terms in Cz involving q™(zi) are C-z ~ 
q°(m) Y, Zi £m{zi)q™{zi) - TS(q ™) where 


( j^rnjrh 
\\\Wi(Zi\m) - ^2 I 7 ™) 

m A ’ 

As before the conditional expectation E ? ™. (G\zi) is G(zi, z_j)g™ (z_j). The mixture probabil- 
ities are thus determined as 


exp(-£’ m (z i )/T) 
E^exp (-£ m (zi)/T)' 


(33) 


6.1 Agent Communication 


These results require minimal communication between agents. An agent, call this the 0-agent, is 
assigned to manage the determination of q°(m), and (i,m)-agents manage the determination of 
q^izi). The M (i, m)-agents for a fixed i communicate their Wi(zi\m) to determine . These 
results along with the B" h " 1 from each (i, m) agent are then forwarded to the 0-agent who forms 
A m,m and B m,rn broadcasts this back to all (z, m)-agents. With these quantities and the local 
estimates for E^™. (G\zi), all qf 1 can be updated independently. 


7 Experiments 

In this section we demonstrate our methods on some simple problems. To keep this already lengthy 
paper from being too large, this section is meant to be illustrative only. The reader is directed to 
(Bieniawski & Wolpert 2004a, Bieniawski et al. 2004, Bieniawski & Wolpert 20046, Lee & Wolpert 
2004, Wolpert &; Lee 2004, Macready & Wolpert 2004a) for many other related experiments. 

As our first example we test the probability collective method on two different problems: a 
A;- sat constraint satisfaction problem having multiple feasible solutions, and optimization of an 
unconstrained optimization of an NK function. 
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(a) (b) 


Figure 3: (a) Evolution of Lagrangian value (solid line), expected constraint violation (dotted line), 
and constraint violations of most likely configuration (dashed line), (b) P(G ) after minimizing the 
Lagrangian for the first 3 multiplier settings. At termination P(G ) = S(G). 

7.1 k- sat 

The k- sat problem is perhaps the best studied CSP (Mezard, Parisi & Zecchina 2002). The goal 
is to assign N binary variables Zi values so that C clauses are satisfied. The ath clause involves 
k variables labelled by v a j £ [1, N] (for j € [1,/c]), and k binary values associated with each a 
and labelled by a a j. The ath clause is satisfied iff \/y= ii z v a ,j = a a,j] is true so we define the ath 
constraint as 

0 if Vj=l [Zv aJ =CTa,j\ 

C a ( z) = 

1 otherwise 

As the ath clause is violated only when all z Va j = a a ,j (with a = not a), the Lagrangian over 
product distributions can be written as C(q) = A T c (q) — TS(q ) where c (q) is the (7-vector of 
expected constraint violations whose ath component is c a (q) = c a (z)q(z) = fl ;=i Qv a j 
and A is the C vector of Lagrange multipliers. The only communication required to evaluate G 
and its conditional expectations is between agents appearing in the same clause. Typically, this 
communication network is sparse; for the N = 100, k = 3, C = 430 variable problem we consider 
each agent interacts with only 6 other agents on average. 

We first present results for a single product distribution. For any fixed setting of the Lagrange 
multipliers, the Lagrangian is minimized by iterating Eq. (9). Had the minimization been done by 
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Figure 4: Each constraint’s Lagrange multiplier versus the iterations when they change. 

the Brouwer method, any random subset of variables, no two of which appear in the same clause, 
could be updated simultaneously while still ensuring that the Lagrangian would decrease at each 
iteration. 

The minimization is terminated at a local minimum q*. If all constraints are satisfied at q* we 
return the solution z* = arg max z q*(z) otherwise the Lagrange multipliers are updated according 
to Eq. (10). In the present context, this updating rule offers a number of benefits. Firstly, 
those constraints which are violated most strongly have their penalty increased the most, and 
consequently, the agents involved in those constraints are most likely to alter their state. Secondly, 
the Lagrange multipliers contain a history of the constraint violations (since we keep adding to A) 
so that when the agents coordinate on their next move they are unlikely to return a previously 
violated state. This mimics the approach used in taboo search where revisiting of configurations 
is explicitly prevented, and aids in an efficient exploration of the search space. Lastly, rescaling 
the Lagrangian after each update of the multipliers by 1 T A = ^ a A a gives C(q) = A T c (q) — TS(q) 
where A = A/1 T A and T = T/1 t A. Since A a = 1 the first term reweights clauses according 
to their expected violation, while the temperature T cools in an automated way as the Lagrange 
multipliers increase. Cooling is most rapid when the expected constraint violation is large and 
slows as the optimum is approached. The parameters a\ thus govern the overall rate of cooling. 
We used the fixed value a\ = 0.5. 

Figure 3 presents results for a 100 variable k = 3 problem using a single mixture. The problem 
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is satisfiable formula uf 100-01. cnf from SATLIB (www.satlib.org). It was generated with the 
ratio of clauses to variables being near the phase transition, and consequently has few solutions. 
Fig. 3(a) shows the variation of the Lagrangian, the expected number of constraint violations, and 
the number of constraints violated in the most probable state z mp = arg max z q{ z) as a function of 
the number of iterations. The starting state is the maximum entropy configuration, and the starting 
temperature is T = 1.5 • 10 -3 . The iterations at which the Lagrange multipliers are updated are 
indicated by vertical dashed lines which are clearly visible as discontinuities in the Lagrangian 
values. To show the stochastic underpinnings of the algorithm we plot in Fig. 3(b) the probability 
density of the number of constraint violations obtained as P(G) = ]F z g(z )6(G — J2 a °a ( z )) - 1T 
Figure 4 shows the evolution of the renormalized Langrange multipliers A. At the first iteration the 
multiplier for all clauses are equal. As the algorithm progresses weight is shifted amongst difficult 
to satisfy clauses. 

Results on a larger problem with multiple mixtures are shown in Fig. 5(a). This is the 250 
variable/1065 clause problem uf 250-01 . cnf from SATLIB with the first 50 clauses removed so that 
the problem has multiple solutions. The optimization was performed by selecting a random subset 
of variables, no two of which appear in the same clause at each iteration, and updating according 
to Eqs. (30), (31), (32), and (33). After convergence the Lagrange multipliers are updated. The 
initial temperature is 10 _1 . We plot the number of constraints violated in the most probable 
state of each mixture as a function of the number of updates, as well as the expected number of 
violated constraints. After 8000 steps three distinct solutions have been found along with a fourth 
configuration which violates a single constraint. 

7.2 Minimization of NK Functions 

The NK model defines a family of tunably difficult optimization problems (Kauffman &; Levin 
1987). The objective of N binary variables is defined as the average of N contributions local 
to each variable z t and involving 0 > K > N — 1 other randomly chosen variables z\ ■ ■ ■ zf ~ : 
G( z) = N~ 1 J2iLiEi(zi-,zj,---zf). For each of the 2 A+1 local configurations is assigned a 
value drawn uniformly from 0 to 1. K controls the number of local minima; under Hamming 
neighborhoods K = 0 optimization landscapes have a single global optimum and K = N — 1 
landscapes have on average 2 N /{N + 1) local minima. Further properties of NK landscapes may 
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Figure 5: (a) The solid curves show the number of unsatisfied clauses in the most probable config- 
uration z mp of each of the 4 mixtures vs iterations. The topmost solid black line plots the expected 
number of violations, and the dashed black line shows the approximation to the JS distance, (b) 
The solid curves show the evolution of the G value of the best z mp configurations for each of 5 mix- 
tures versus number of iterations. The dashed black line shows the corresponding approximation 
to the JS distance. 

be found in (Durrett &; Limic 2001). Fig. 5(b) plots the energy of a 5 mixture model for a multi- 
modal N = 300 K = 2 function. The K — 1 spins other than i upon which £) depends were selected 
at random. At termination of the PC algorithm (at a small but non-zero temperature), five distinct 
configurations are obtained with the nearest pair of solutions having Hamming distance 12. Note 
that unlike the k sat problem which has multiple configurations all having the same global minimal 
energy, the JS distance (the dashed curve) of Fig. 5(b) drops to zero as the temperature decreases. 
This is because at exactly zero temperature there is no term forcing different solutions, and the 
Lagrangian is minimized by having all mixtures converge to delta functions at the lowest objective 
configuration. 


8 Relation of PC to other work 

There has been much work from many fields related to PC. The maxent Lagrangian has been used 
in statistical physics for over a century under the rubric of “free energy” . Its derivation in terms of 
information theoretic statistical inference was by Jaynes (Jaynes 1957). The maxent Lagrangian has 
also appeared occasionally in game theory as a heuristic, without a statistical inference justification 
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(be it information-theoretic or otherwise) (Fudenberg & Levine 1998, Shamma & Arslan 2004). 18 
In none of this earlier work is there an appreciation for its relationship with the related work in 
other fields. 

In the context of distributed control/optimization, the distinguishing feature of PC is that it 
does not view the variable x as the fundamental object, but rather the distribution across it, q. 
Samples of that distribution are not the direct object of interest, and in fact are only used if 
necessary to estimate functionals of q. The fundamental objective function is stated in terms of q. 
As explicated in the next subsection, the associated optimization algorithms are related to some 
work in several fields. Heretofore those fields have been unaware of each other, and of the breadth 
of their relation to information theory and game theory. 

Finally, we note that the rnaxent or qp Lagrangian C(q) = E q (G) — TS(q) can be viewed as 
a barrier-function (interior point) method with objective E q (G). An entropic barrier function is 
used to enforce the constraints qi{xi) > 0 Vi and Xi, with the constraint that all % sum to 1 being 
implicit. 

8.1 Various schemes for updating q 

We have seen that the qp Lagrangian is minimized by the product distribution q given by 


Qi(xi) oc exp(-E q _ i (G\x i ) /T) . (34) 

Direct application of these equations that minimize the Lagrangian form the basis of the Brouwer 
update rules. Alternatively, steepest descent of the rnaxent Lagrangian forms the basis of the 
Nearest Newton algorithm. These update rules have analogues in conventional (non-PC) opti- 
mization. For example, Nearest-Newton is based on Newton’s method, and Brouwer updating is 
similar to block-relaxation. This is one of the advantages of embedding the original optimization 
problem involving x in a problem involving distributions across x: it allows us to solve problems 
over non- Euclidean (e.g., countable) spaces using the powerful methods already well- understood 
for optimization over Euclidean spaces. 

However there are other PC update rules that have no direct analogue in such well- understood 
methods for Euclidean space optimization. Algorithms may be developed that minimize the pq 
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Lagrangian D(pt\\(j) where pt( x ) = e x p(— G(x)/T)/Z(T) with Z(T) being the normalization of 
the Boltzmann distribution. The pq Lagrangian is minimized by the the product of the marginals of 
the Boltzmann distribution, i.e. qi{xf) = f dx_i pr(*)- Another example of update rules without 
Euclidean analogues are the iterative focusing update rules described in (Wolpert 20046). Iterative 
focusing updates are intrinsically tied into the fact that we’re minimizing (the distribution setting) 
an expectation value. 

A subset of update rules arising from qp and pq Lagrangians are described in (Wolpert 20046). 
In all cases, the updates may be written as multiplicative updating of q. The following is a list of 
the update ratios r q) i{xi) = q^ +1 (xi) / q\{xf) of some of those rules. In all of these, Fq is a probability 
distribution over x that never increases between two x’s if G does (e.g., a Boltzmann distribution 
in G'(.t)). In addition, const is a scalar that ensures the new distribution is properly normalized 
and a is a stepsize. 19 
Gradient descent of qp distance to Fq: 

rE,t(lnF G |®i)+ln(?i(*i)h const 

1- “l WF rwF (35) 

Nearest Newton descent of qp distance to Fq: 

1 — a{~K q t (In Fc\xi) + \n(qj(xi ))} — const (36) 


Brouwer updating for qp distance to Fq • 

ffiqt (In Fa\xi) 
const x r- — 


Importance sampling minimization of pq distance to Fq: 


const x E q t 



(37) 


(38) 
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Iterative focusing of q with focusing function Fq using qp distance and gradient descent: 

j-EgtllnFoIXi) + ln||^ ^ const 

7W) I'FF) (39) 

Iterative focusing of q with focusing function Fq using qp distance and Nearest Newton: 

1 — a^K q t(\n.Fc\xi) + In | — const (40) 

^ Qi x^i ) ' 

Iterative focusing of q with focusing function Fq using qp distance and Brouwer updating: 

const x x ( 41 ) 

Qi(xi) 

Iterative focusing of q with focusing function Fq using pq distance: 

const x Eq(F G \xi) x (42) 

Note that some of these update ratios are themselves proper probability distributions, e.g., the 
Nearest Newton update ratio. 

This list highlights the ability to go beyond conventional Euclidean optimization update rules, 
and is an advantage of embedding the original optimization problem in a problem over a space of 
probability distributions. Another advantage is the fact that the distribution itself provides much 
useful information (e.g., sensitivity information). Yet another advantage is the natural use of Monte 
Carlo techniques that arise with the embedding, and allow the optimization to be used for adaptive 
control. 

There are also overlaps of PC with evolutionary approaches to optimization (e.g., genetic algo- 
rithms). Many techniques in the evolutionary computation community use Boltzmann distributions 
in ad hoc ways to update a “population” of x’s, e.g., “truncation selection” and “Boltzmann selec- 
tion” (G.Ficici, Melnik, & Pollack 2000). These rules are not formally derived, and do not directly 
concern themselves with distributions q. However some of them are similar to the PC update 
rules, in particular iterative focusing rules, only applied to sets of Monte Carlo sample points (the 


43 



population) rather than to q. There are other techniques which also use Boltzmann distributions 
and samples, although without a multi-member population, e.g., simulated annealing. 

These early techniques do not consider the underlying distribution that gets sampled to produce 
the population. Such consideration was introduced in PBIL (Baluja 2002), MIMIC (Bonet, Jr. & 
Viola 1996) and other EDA algorithms (Larraaga & Lozano 2001), followed shortly by the powerful 
CE method (Rubinstein & Kroese 2004). 

However while considering distributions, none of this early work casts the objective as a min- 
imization of a functional of that distribution. Accordingly, all the power arising from minimizing 
Euclidean vectors is absent in this work. There is none of the second order methods, difference 
utilities, or data-ageing that appear to be crucial for very large problems. Nor does this earlier 
work exploit semicoordinate transformations (the topic of this paper), oracle-based methods (Lee 
& Wolpert 2004), etc. Despite this, the results in (Rubinstein & Kroese 2004) in particular are 
compelling. (Note that the CE method, as applied, is identical to Eq. (42), although with a different 
justification.) 

There is other previous work on optimization that has directly considered the distribution q as 
the object of interest. In particular deterministic annealing (Duda et al. 2000) is “bare-bones” 
parallel Brouwer updating. This involves no data-aging (or any other scheme to avoid thrashing of 
the agents), difference utilities, etc.. 20 

Most tantalizingly, probability matching (Sabes &; Jordan 1996) uses Monte Carlo sampling to 
optimize a functional of q. However this work was in the context of a single agent, did not exploit 
techniques like data-ageing, and was not pursued. 

Other work has both viewed q as the fundamental object of interest and used techniques like 
data-aging and difference utilities (Wolpert, Turner &; Frank 1999, Wolpert et al. 2003, Wolpert 
2003 c). However this work was not based on information-theoretic considerations and had no 
explicit objective function for q. It was the introduction of such considerations that resulted in PC. 

Finally, shortly after the introduction of PC a variant of its Monte Carlo version of parallel 
Brouwer updating has been introduced, called the MCE method (Rubenstein 2005). In this variant 
the annealing of the Lagrangian doesn’t involve changing the temperature /3, but instead changing 
the value of a constraint specifying E q (G). Accordingly, rather than jump directly to the (/3- 
specified) solution given above, one has to solve a set of coupled nonlinear equations relating all 
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the q t . (Another distinguishing feature is no data-ageing, difference utilities or the like are used in 
the MCE method.) The MCE method has been justified with the KL argument reviewed above 
rather than with “ratchet” -based maximum entropy arguments. This has redrawn attention to the 
role of the argument-ordering of the KL distance, and how it relates Brouwer updating and the CE 
method. 

9 Conclusion 

A distributed constrained optimization framework based on probability collectives has been pre- 
sented. Motivation for the framework was drawn from an extension of full-rationality game theory 
to bounded rational agents. An algorithm that is capable of obtaining one or more solutions simul- 
taneously was developed and demonstrated on two problems. The results show a promising, highly 
distributed, off-the-shelf approach to constrained optimization. 

There are many avenues for future exploration. Alternatives to the Lagrange multiplier method 
used here can be developed for constraint satisfaction problems. By viewing the constraints as 
separate objectives, a Pareto-like optimization procedure may be developed whereby a gradient 
direction is chosen which is constrained so that no constraints are worsened. This idea is motivated 
by the highly successful WalkSAT (Selman, Kautz & Cohen 1996) algorithm for A;- sat in which 
spins are flipped only if no previously satisfied clause becomes unsatisfied as a result of the change. 

Probability collectives also offer promise in devising new methods for escaping local minima. 
Unlike traditional optimization methods where monotonic transformations of the objective leave lo- 
cal minima unchanged, such transformations will alter the local minima structure of the Lagrangian. 
This observation, and alternative Lagrangians (see (Rubinstein 2001) for a related approach using 
a different minimization criterion) offer new approaches for improved optimization. 
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Notes 

x |5| denotes the number of elements in the set S. 

2 In this paper vectors are indicated in bold font and scalars are in regular font. 
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3 Despite its popularity, the KL distance D(q\\p) between two probability distributions q and p is not a proper 
metric. It is not even symmetric between q and p. However it is non-negative, and equals zero only when q = p. 

4 Here and throughout, we fix the convention that it is desirable to minimize objective functions, not to maximize 
them. 

5 We adopt the notation that q~i indicates the distribution q with variable i marginalized out, i.e., the product 
n^, Qi- Analogously, x_i = [aq, • • • , Xi-i,x i+ i,- • • , *„] 

6 Properly speaking one should find the global minimizer q. Here we content ourselves with finding local minima. 

7 N.b., we do not project onto Q but rather add a vector to get back to it. See (Wolpert & Bieniawski 20046). 

s For such a distribution we relax the requirement of being a product or having any other particular form; we only 
require that all 0 < 7r(x) < 1 and 7r(x) = 1. 

9 Quadratic loss can be roughly related to the assumption that the Lagrangian is locally well-approximated by a 
paraboloid. We want to estimate f using an estimator f(n) which is a function of prior information n. If we penalize 
errors quadratically (i.e. the error incurred when the estimate is f and the true value is f is Err(f , f) = ||f — f || 2 ) , then 
the expected error conditioned on n is Ef| n (Err) = J df P(f|n) ||f — f || 2 = ||f || 2 — 2f T Ef| n (f) +Ef|,j(||f || 2 ). Minimizing 
this with respect to the estimator f gives the optimal quadratic loss estimator f(n) = Ef|„(f) Uncertainty in f arises 
due to uncertainty in and hence we integrate over the unknown q. 

10 The bias-variance decomposition decomposes the error of an estimator into two contributions. Again let / be the 
true quantity and let f(n ) be an estimator of / formed from data n. We are interested in the error Err(/, /) = (/ — f) 2 
when averaged over different data sets n so we consider E/ jT j (Err) = f dndf P(n, /)(/ — f) 2 . By adding and subtracting 
E /(/) -En(/) we have E/, n (Err) = f dndf P(n, /)({/ — E n (/)} + (E/(/) - /} + (E„(/) -E/(/)}) 2 . The cross term 
vanish when integrating, and we are left with E/ !TO (Err) = E n ([/ — E n (/)] 2 ) + E / ([/ — E/(/)] 2 ) + [E/(/) — E n (/)] 
The first term is the variance of the estimator across different data sets. The second term is the noise inherent in the 
quantity being estimated which is independent of the estimator, and which we will therefore ignore. The final term 
is the square of the bias of the estimator. 

11 Another alternative is to use data-ageing so that data from preceding Monte Carlo blocks can be re-used. Yet 
another alternative is to force a sample for each Xi value that has no samples at the end of the block, x\. This is 
done by drawing a random sample of the distribution 5(xi — x' i )q-i{'x.-i). 

12 To Monte Carlo estimate a conditional expected utility for agent i it is crucial that x_i be generated by sampling 
the associated distribution g_;. However the value Xi can be set in any fashion whatsoever. 

13 Just like AU, less sophisticated versions of WLU were previously explored under the same name, with examples 
again arising in (Wolpert & Turner 2001, Wolpert, Wheeler & Turner 2000) and references therein. 

14 To see this, say we add a function T‘(x_i) to G(x) both places G occurs in the Nearest Newton update rule. 
Then since qi(xi) is evaluated exactly, the data average of the empirical estimates of those two terms involving D 1 
both give the expectation value a t qj(xi ) E t ,(D X ) exactly. Accordingly, those two terms cancel. 

15 The Jacobian factor is irrelevant as £ is a permutation. 

16 Note that if Wi(zi\m) = l/\Zi\ is uniform across Zi then = l/\Zi\ and B" 1 ’ 171 = — ln|.2j|. Maximizing over 
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v m we find that J(q, w = \/\Z\, v = v*) = 0. Thus, maximizing with respect to w increases the JS distance from 0. 

11 In determining the density 10 4 samples were drawn from q( z) with Gaussians centered at each value of G(z, 1) 
and with the width of all Gaussians determined by cross validation of the log likelihood. The fact that there is 
non-zero probability of obtaining non-integral numbers of constraint violations is an artifact of the finite width of the 
Gaussians. 

ls However an attempt at a first-principles derivation can be found in (Meginniss 1976). While fascinating, this 
attempt is flawed (D. Luce, private communication). 

19 As a practical matter, both Nearest Newton and gradient-based updating have to be modified in a particular 
step if their step size is large enough so that they would otherwise take one off the unit simplex. This changes the 
update ratio for that step. See (Wolpert & Bieniawski 20046). 

20 Indeed, as conventionally cast, deterministic annealing assumes the conditional G can be evaluated in closed 
form, and therefore has no concern for Monte Carlo sampling issues. 
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